AITopics

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.67)
Education (0.46)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)

arXiv.org Machine LearningJun-5-2026

Adaptive Learning Rates with Surrogate Probability for Follow-the-Perturbed-Leader

Lee, Jongyeong, Honda, Junya, Ito, Shinji, Kim, Chansoo

Follow-the-regularized-leader framework has shown effectiveness and flexibility in online learning problems, where the choice of learning rates are known to be crucial. Recently, adaptive learning rates defined in terms of the arm-selection probabilities, obtained by solving convex optimization, have achieved improved best-of-both-worlds (BOBW) guarantees in various bandit problems. In contrast, BOBW guarantees for its computationally efficient alternative, follow-the-perturbed-leader (FTPL), remain relatively limited since its optimization-free nature ironically makes the design of adaptive, probability-dependent learning rates non-trivial. To address this challenge, we propose an adaptive learning rate for FTPL by introducing surrogate probability functions that can be computed only from the available quantities, without requiring the exact probabilities. Based on these learning rates with surrogate functions, we provide the BOBW guarantee for FTPL with Pareto perturbations for any shape parameter $α>1$, generalizing prior results restricted to specific choices of $α=2$. We further show the BOBW guarantees for FTPL with adaptive learning rates in the bandit problem with expert advices. Our approach preserves the computational simplicity of FTPL while enabling probability-dependent adaptivity, and the surrogate-based methodology may be of independent interest in other algorithmic frameworks beyond FTPL and learning rate designs.

bobw guarantee, data mining, machine learning, (19 more...)

2606.06043

Country: Europe (0.28)

Genre: Research Report (0.64)

Industry: Education (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.87)

arXiv.org Machine LearningJun-2-2026

MINTS: Minimalist Thompson Sampling

Wang, Kaizheng

The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the location of the optimum, while eliminating nuisance parameters through profile likelihood. This yields a generalized posterior that naturally accommodates structural constraints. As a direct instantiation, we develop MINimalist Thompson Sampling (MINTS). For multi-armed bandits with mean constraints, we establish near-optimal non-asymptotic regret guarantees and sharp almost-sure asymptotic regret characterizations. In particular, MINTS attains the classical Lai--Robbins constant in the unstructured setting and automatically adapts to unimodal structure, achieving the sharp constant determined only by the immediate neighbors of the optimal arm.

bandit, data mining, machine learning, (20 more...)

2606.01655

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

arXiv.org Machine LearningMay-8-2026

Bandit Learning in General Open Multi-agent Systems

Xu, Mengfan

Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new agents make regret scale further with the time horizon. To this end, we formulate a unified open-system bandit problem with general dynamics, including heterogeneous rewards and general agent patterns. We introduce new concepts to capture the inherent complexities: the \emph{pre-training degree} of new agents quantifies how much information an agent carries upon entry, \emph{stability} measures the impact of new agents on the system, and \emph{global dynamic regret} compares the cumulative expected reward of all active agents with that of the varying optimal arms. We develop certified global-UCB learning methodologies with provable guarantees. Our regret bounds reveal that entry uncertainty enters linearly via the pre-training degree, while in stable regimes, regret is governed by the time needed to identify a persistent optimal arm, as well as by the agent patterns. We further show that these dependencies are tight via lower bounds in hard instances.

agent, artificial intelligence, machine learning, (18 more...)

2605.06202

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Industry:

Education (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

arXiv.org Machine LearningMay-5-2026

On the Optimal Sample Complexity of Offline Multi-Armed Bandits with KL Regularization

Ji, Kaixuan, Di, Qiwei, Zhao, Heyang, Zhao, Qingyue, Gu, Quanquan

Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $\tilde{O}(ηSAC^{π^*}/ε)$ under large regularization $η= \tilde{O}(ε^{-1})$, and a sample complexity of $\tildeΩ(SAC^{π^*}/ε^2)$ under small regularization $η= \tildeΩ(ε^{-1})$, where $η$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^{π^*}$ policy coverage coefficient at the optimal policy $π^*$, $ε$ is the desired sub-optimality, and $\tilde{O}$ and $\tildeΩ$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.

data mining, machine learning, natural language, (18 more...)

2605.02141

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Data Science > Data Mining > Big Data (0.68)

Neural Information Processing SystemsApr-30-2026, 03:53:22 GMT

BanditPAM++: Faster k-medoids Clustering

Clustering is a fundamental task in data science with wide-ranging applications. In k-medoids clustering, cluster centers must be actual datapoints and arbitrary distance metrics may be used; these features allow for greater interpretability of the cluster centers and the clustering of exotic objects in k-medoids clustering, respectively.

data mining, machine learning, natural language, (21 more...)

Country: North America > United States (0.46)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Neural Information Processing SystemsApr-29-2026, 23:33:25 GMT

Statistical and Computational Trade-off in Multi-Agent Multi-Armed Bandits

We study the problem of regret minimization in Multi-Agent Multi-Armed Bandits (MAMABs) where the rewards are defined through a factor graph. We derive an instance-specific regret lower bound and characterize the minimal expected number of times each global action should be explored. This bound and the corresponding optimal exploration process are obtained by solving a combinatorial optimization problem whose set of variables and constraints exponentially grow with the number of agents, and cannot be exploited in the design of efficient algorithms. Inspired by Mean Field approximation techniques used in graphical models, we provide simple upper bounds of the regret lower bound. The corresponding optimization problems have a reduced number of variables and constraints. By tuning the latter, we may explore the trade-off between the achievable regret and the complexity of computing the corresponding exploration process. We devise Efficient Sampling for MAMAB (ESM), an algorithm whose regret asymptotically matches the approximated lower bounds. The regret and computational complexity of ESM are assessed numerically, using both synthetic and real-world experiments in radio communications networks.

algorithm, artificial intelligence, optimization problem, (15 more...)

Country: Europe > Sweden (0.14)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.48)

Neural Information Processing SystemsApr-26-2026, 00:26:03 GMT

Parallelizing Thompson Sampling

How can we make use of information parallelism in online decision making problems while efficiently balancing the exploration-exploitation trade-off? In this paper, we introduce a batch Thompson Sampling framework for two canonical online decision making problems, namely, stochastic multi-arm bandit and linear contextual bandit with finitely many arms. Over a time horizon T, our batch Thompson Sampling policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only O(log T) batch queries. To achieve this exponential reduction, i.e., reducing the number of interactions from T to O(log T), our batch policy dynamically determines the duration of each batch in order to balance the exploration-exploitation trade-off. We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations.

artificial intelligence, data mining, machine learning, (19 more...)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.30)

Neural Information Processing SystemsApr-25-2026, 19:58:43 GMT

Bandit Social Learning under Myopic Behavior

We study social learning dynamics motivated by reviews on online platforms. The agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals for the arms' expected rewards. We derive stark exploration failures for any such behavior, and provide matching positive results. As a special case, we obtain the first general results on failure of the greedy algorithm in bandits, thus providing a theoretical foundation for why bandit algorithms should explore.1

artificial intelligence, data mining, machine learning, (20 more...)